Graph Neural Network-Based Method of Spatiotemporal Land Cover Mapping Using Satellite Imagery

Multispectral satellite imagery offers a new perspective for spatial modelling, change detection and land cover classification. The increased demand for accurate classification of geographically diverse regions led to advances in object-based methods. A novel spatiotemporal method is presented for object-based land cover classification of satellite imagery using a Graph Neural Network. This paper introduces innovative representation of sequential satellite images as a directed graph by connecting segmented land region through time. The method’s novel modular node classification pipeline utilises the Convolutional Neural Network as a multispectral image feature extraction network, and the Graph Neural Network as a node classification model. To evaluate the performance of the proposed method, we utilised EfficientNetV2-S for feature extraction and the GraphSAGE algorithm with Long Short-Term Memory aggregation for node classification. This innovative application on Sentinel-2 L2A imagery produced complete 4-year intermonthly land cover classification maps for two regions: Graz in Austria, and the region of Portorož, Izola and Koper in Slovenia. The regions were classified with Corine Land Cover classes. In the level 2 classification of the Graz region, the method outperformed the state-of-the-art UNet model, achieving an average F1-score of 0.841 and an accuracy of 0.831, as opposed to UNet’s 0.824 and 0.818, respectively. Similarly, the method demonstrated superior performance over UNet in both regions under the level 1 classification, which contains fewer classes. Individual classes have been classified with accuracies up to 99.17%.


Introduction
Land cover mapping represents spatial information about the different types of the Earth's surface coverage. Historically, one of the most influential causes of land cover changes is urbanisation [1,2]. Temporal observation of land cover changes is necessary for advanced crop [3] and disaster management (e.g., forest fires [4] and landslides [5]). Different land cover types have a heterogeneous regional impact on climate change [6]. This increases the demand for automatic land cover classification of geographically diverse regions further.
Breakthroughs in satellite control systems [7] and the development of techniques for well-organised formation flight above the Earth [8] have expanded the potential applications of satellite technology significantly. Satellite imagery, as a form of Big Data obtained with multispectral scanners, offers new perspectives in using remote sensing for land cover mapping and change detection [9][10][11]. Non-commercial satellites (e.g., Landsat, Sentinel) are a good alternative to commercial satellites for obtaining high spatial resolution imagery [12][13][14].
Several satellite imagery land cover datasets have been created. The Reunion Island dataset [15] consists of a time series of 21 Sentinel-2 images acquired in 2017. A respective pixel-wise ground truth map was built from various sources and completed by field experts [16]. Given the increasing desertification, rapid deforestation, abandonment of In this paper, we present a novel spatiotemporal method for object-based LULC classification of satellite imagery using a GNN. The method starts with a superpixel segmentation, followed by segments being connected through time. Each individual segment is represented as a node in the graph. A target node (i.e., a selected segment) is classified by passing the subgraph with the target node and its neighbourhood through the classification pipeline. In the first step, the pipeline performs the image feature extraction with CNN. Importantly, the selection of the CNN architecture is not constrained to any specific requirements. This provides freedom to use any of the various state-of-the-art architectures, such as Vision Transformer (ViT) [50], EfficientNetV2 [51], EfficientNet [52], ResNet [53], ResNeXt [54], InceptionV3 [55], or DenseNet [56]. Following the extraction of features, the pipeline proceeds to the classification step, utilising a GNN to classify the target node. The choice of a GNN is as versatile as the CNN selection, offering options such as the Graph Attention Network (GAT) [57] or the SAmple and aggreGatE (GraphSAGE) framework [58]. This highlights the flexibility and modularity of the proposed method.
The main contributions of this paper, listed below, introduce novelties that have not been considered by the prior existing state-of-the-art ( [16]): • A new representation of the sequential satellite images as a directed graph by connecting segmented land region through time, based on sequential spatial segment overlaps. • A new land cover mapping method as node classification in the derived directed graph using the GNN. The proposed method allows selection of a target node's neighbourhood, which contains historical temporal context information of connected land region segments. The size of the neighbourhood determines the volume of input spatial and temporal information that the selected GNN uses for node classification. • A modular target node classification pipeline, which offers flexible selection of a CNN for image feature extraction and a GNN for node classification. • The first application of using EfficientNetV2 as a feature extractor for GraphSAGE classification models to perform intermonthly land cover classification. • Complete intermonthly land cover classification maps for the given regions by using Sentinel-2 imagery, as shown in the Section 5.
The rest of this paper is organised as follows. Section 2 reviews related works. Section 3 presents the proposed land cover classification method. Dataset creation to evaluate the proposed method to the state-of-the-art is described in Section 4. The classification results and discussion are given in Section 5. Section 6 concludes this paper.

Related Work
In this section, we review related work in two categories: object-based land cover classification of satellite imagery and GNNs for land cover classification.

Object-Based Land Cover Classification of Satellite Imagery
A classification in hierarchical geospatial databases with a two-step deep learning framework has enabled simultaneous prediction of land use in all hierarchical levels [34]. This method achieved overall accuracies in the order of 90% for two datasets [34]. The main dataset contained satellite images of the settlements of Hameln and Schleswig in Germany with 8 land cover classes. State-of-the-art comparison was performed using the dataset from the ISPRS 2D semantic labelling challenge with satellite images of the settlements of Vaihingen and city of Potsdam in Germany with 6 land cover classes [59].
A multi-level context-guided method with the object-based CNN offers learning of high-level features from spectral patterns, geometric characteristics and object-level contextual information, based on a segmented context patch [32]. This method achieved remarkable accuracy (>80%) compared to traditional methods [32].
A semi-automated object-based land cover classification to observe changes in the Kurdistan region of Iraq using Landsat imagery was proposed [60]. After segmentation of the region, the spatial and spectral information were combined and followed by the object feature extraction. The classification was performed with a nearest neighbour classifier, achieving accuracies in the range of 88% to 89%.
A multi-view object-based classification with a CNN has proved to achieve a higher overall accuracy of 82% compared to application of a CNN on an orthoimage, which achieved an accuracy of 65% [61]. The results highlight that the window-based version of the proposed method can achieve better accuracy than the full object version.
Images from a multispectral camera, LISAT, onboard the third-generation satellite by LAPAN, named LAPAN-A3, were used in a study to perform object-based land cover classification of the Rote Island, using a tree method algorithm [62]. The results achieved overall classification accuracy of 86% for different classes, including open land, settlements, water bodies and diverse vegetation.
The study examined the potential of high-resolution PlanetScope data, made accessible on Google Earth Engine, to enhance land cover classification of Central Brazil, utilising both pixel-based and object-based methods [63]. The results revealed that the exclusive use of PlanetScope data yielded a 67% overall accuracy for pixel-based techniques and 82% for object-based methods. The fusion of PlanetScope data with Sentinel imagery data raised the overall accuracy to 82% for pixel-based and an impressive 91% for object-based methods.
A proposed method for paddy field mapping in Iran used rice phenology, yearly land surface temperature data and multi-temporal Landsat-8 imagery [64]. Enriched by a digital elevation model and spectral indices, the object-based classification outperformed the pixel-based approach with overall accuracy of 94% vs. 92%.

GNNs for Land Cover Classification
A Graph Convolutional Network (GCN) is a variant of CNNs which operates on graphs [65]. The layers in the model learn to encode both the local graph structure and the node features efficiently by performing approximation of the spectral convolution. Temporal GCN (T-GCN) is a combination of GCN and a Gated Recurrent Unit (GRU) to capture spatial and temporal dependencies simultaneously. T-GCN was designed originally for predicting traffic [66].
The Graph Attention Network (GAT) leverages masked self-attentional layers to help in extracting information from features of nodes [57]. The Symmetric Relation based Heterogeneous Graph Attention Network, denoted as SR-HGAT, is an extension of the original GAT network [67]. SR-HGAT takes the high-order relations of heterogeneous graphs into account [67]. A Robust Representation Learning method for multilabel images driven by GATs (RRL-GAT) reduces noisy and false connections between the connected image objects [68].
GraphSAGE (SAmple and aggreGatE) is a framework for generating inductive embeddings by sampling a neighbourhood of a node and then using a learnable aggregator function to aggregate features [58]. The data-efficient algorithm PinSAGE combines highly efficient random walks (improved upon GraphSAGE) and the graph convolutions for processing of graphs with billions of nodes [69].
A U-shaped object graph neural network (U-OGNN) for accurate land cover mapping using high-resolution remote sensing images was proposed [70]. The U-OGNN is composed of a self-adaptive graph construction that employs similarity measures to generate a contextual-aware graph structure. It also features a graph encoder and decoder, which fuse information across various scales, capturing hierarchical features of adjacent objects in images. The method demonstrated the overall accuracy of 88%.
A method for land cover classification, named AF2GNN, was proposed, using hyperspectral images as input to a GNN with adaptive filters and aggregator fusion [49]. Degree-scalers are defined to combine the multiple filters and present the graph structure. This method addresses the common problems of land cover discrimination, noise impact and spatial feature learning. The method achieved overall accuracies above 94% for various datasets.
An Attentive Spatial Temporal Graph CNN exploits the spatial and temporal information in satellite image time series in the context of land cover mapping [16]. After performing the segmentation with the SLIC algorithm [71], the graph is formed by connecting the adjacent segments. Each georeferenced segment or node contains a time series of segments. Target and neighbourhood segments are passed to separate inputs of the classification NN, which utilises the attention mechanism. The NN outputs the classification of the target segment. To the best of our knowledge, this is the first and only spatiotemporal application of GNNs for remote sensing data.
The input graphs into the Attentive Spatial Temporal Graph CNN represent connected segments based on an adjacency in the space without using any kind of temporal information [16]. This method obtained the graphs by segmenting a hand-picked time series image that experts considered suitable. The proposed method in this paper takes shifting and growing of segments into account additionally. The thresholded temporal connections between segments are formed, based on sequential spatial segment overlaps in segmented time series images. The GNN in the proposed method can be replaced by any chosen GNN algorithm, making the proposed method modular, as opposed to the Attentive Spatial Temporal Graph CNN [16], which uses the attention mechanism from GATs exclusively.

Methodology
This paper proposes a novel spatiotemporal method for the LULC classification of satellite imagery using a GNN. The workflow of the proposed method is shown in Figure 1. The proposed method starts by performing superpixel segmentation (step 1 in Figure 1) on each multispectral image in the time series of satellite images TS image of length T. This results in a time series of segmentation masks TS mask . The next step is the graph construction (step 2 in Figure 1), based on temporal segment overlaps, where each segment in each satellite image is represented as a node v. An edge between two nodes is made if two segments in two sequential satellite images overlap sufficiently. The constructed graph G contains the same number of nodes as the number of all superpixels in TS mask . Each v contains the segment representation as a resized multispectral image bounding box, bbox. Next, a subgraph G sub is created for each target node v target based on the input edges towards it (step 3 in Figure 1). The v target represents the selected segment s selected to classify. The G sub is then passed into the target node classification pipeline (step 4 in Figure 1). It performs feature extraction from bbox with any chosen CNN (step 4.1 in Figure 1). This is followed by v target classification with a selected GNN (step 4.2 in Figure 1), which outputs the class probabilities. The land cover class of s selected is determined based on the highest probability. The final result is the time series of the land cover mappings TS cover .
The following subsections provide detailed descriptions of all four main steps of the proposed method.

Superpixel Segmentation
Superpixels are connected groups of similar pixels that form a visually meaningful entity while reducing the number of primitives for subsequent processing. This makes many algorithms computationally feasible on high dimensional images. Pixels are grouped by colour, proximity and other low-level properties [72].
In the proposed method, a multispectral image with C layers is segmented to create meaningful land segments, which surround objects, e.g., buildings, field crops, forests, rivers and roads. The superpixels are calculated with Felzenszwalb's efficient graph based image segmentation [73]. The algorithm has three parameters [73]: • σ-the Standard Deviation of the Gaussian kernel to smooth the image in the preprocessing stage; • k-a scale of observations for the threshold function, which controls the degree of required difference between two adjacent superpixels (a larger k causes a preference for larger superpixels); • min_size-the minimum number of pixels inside the superpixel to control the merging of neighbouring superpixels in the postprocessing stage.    The result of segmenting a single satellite image, captured at time t, is a segmentation mask mask t , which is the t-th mask in TS mask .
The application of Felzenszwalb's image segmentation algorithm is demonstrated in Figure 2. The segmentation masks with k = 70 reveal that large superpixels (i.e., segments) enclose too many meaningful land objects. The results from a small k value contain too many small segments, which leads directly to higher computational cost in the later steps of the method. The value of min_size should be adjusted accordingly, to avoid the presence of the remaining small segments in heterogeneous areas of the image.

Graph Construction
The TS mask is the fundamental for constructing a weighted directed graph G, connecting segments through time. Each i-th segment s from the mask t , labelled as s i,t , is added as a node v i,t in G. The created G contains the same amount of nodes as there are segments in TS mask . An individual v i,t forms a directed edge to v j,t+1 , if s i,t overlaps at least τ (the value range is the (0, 1.0]) portion of the s j,t+1 . The overlap portion serves as the weight w i,j of the formed edge. An example of graph construction is shown in Figure 3.

Segment Representation and Subgraph Construction
An individual v ∈ G holds the corresponding segment's multispectral image bounding box, bbox. It can be extended in width and height to capture additional spatial contextual information around the segment. Due to segments being of different shapes and, consequently, the bounding boxes being of various sizes, the resizing of each bbox must be performed to the width bbox w and height bbox h . The resizing keeps the original aspect ratio of the multispectral image inside the bbox by zero padding. This resizing is needed to ensure that all data inside the nodes are of equal size. It is worth noting that bbox contains a bbox C number of layers.
The target node v target (i.e., the selected segment s selected ) and it's neighbourhood of nodes in G form a subgraph G sub . The parameter T lookback controls the historical temporal context information in G sub by determining the maximum distance of the nodes, based on the input edges towards the v target . By setting the value of T lookback = 0, only the v target at time t will be present in the constructed G sub . By increasing the value of T lookback to, e.g., 2, the G sub will feature the v target at time t, all nodes at time t − 1 with directed edges towards the v target , and all nodes at time t − 2 with edges directed towards all previously included nodes in time t − 1.
The largest subgraphs can be constructed for the leaf nodes of G, which represent segments obtained for the final image in TS satellite_image at time t = T. Even though T lookback can be of any number greater than or equal to 0, each G sub can contain nodes and edges only up to the first image at t = 0. An important fact is that subgraphs of all target nodes in t = 0 actually contain only the v target itself, because the nodes at time t = 0 do not contain any input edges. The G sub construction is demonstrated in Figure 4. All the visualised subgraphs have bounding boxes drawn around the included nodes.

Node Classification Pipeline
The G sub serves as the input to the target node classification pipeline, which outputs the classification probabilities of the s selected , represented by the v target . The node classification pipeline is visualised in Figure 5. The first step is the feature extraction, which is performed ∀v ∈ G sub (step 1 in Figure 5). Due to the bbox inside each v containing a bbox C number of layers and the majority of state-of-the-art CNNs requiring a 3 channel input, the bbox is first passed through the 2D convolution with 3 kernels of size 1×1 and a stride of 1. This converts the depth of the input from bbox C to 3. The 3 feature maps are then passed to the CNN. It extracts F features in the form of a feature vector, which is then stored inside the v. The value of F depends on the output size of the last feature extraction layer of the CNN (the softmax classification layer at the end of NN is removed). For majority of established CNNs, the number of output features from last feature extraction layer cannot be adjusted without modifying previous layers of the NN. After feature extraction, the G sub is passed as the input to the GNN (step 2 in Figure 5). It contains the same number of layers as the value of T lookback (the only exception being the case of T lookback = 0 with GNN containing a single layer). The GNN outputs N class probabilities of v target . The class of s selected is determined as the class with the highest probability. After the classification of ∀v target ∈ G, the time series is created of land cover mappings TS cover .

Dataset Preparation
Datasets with LULC classification of satellite imagery on a daily/weekly/monthly basis for a multi-year period of time are not available. This Section describes creation of an intermonthly dataset to evaluate and compare the performance of the proposed method to the state-of-the-art. The first Subsection provides details about acquisition of Sentinel-2 images of the selected regions. The second Subsection describes the process of creating land cover ground truth.

Intermonthly Satellite Imagery Acquisition
To create the land cover dataset, the first step was to acquire atmospherically corrected Sentinel-2 L2A images, with the earliest globally available being from 2017. From 1992 to 2015 the conversion of cropland to artificial surfaces was notably high in Austria and Slovenia, with rates above 4% and 3% respectively, exceeding the average rate of above 2% observed across the European Union (EU28). In the same period, Austria and Slovenia also underwent a conversion of (semi-)natural vegetated areas to cropland at rates above 2% and nearly 6% respectively, with the EU28 average just under 3% (https://www.oecd.org/ environment/indicators-modelling-outlooks/monitoring-land-cover-change.htm (accessed on 12 July 2023)). Given these trends, a region of Graz in Austria, and the region of the cities-Portorož, Izola and Koper-in Slovenia were selected, due to their spatial heterogeneity.
Images were obtained for the two selected regions: Graz, Austria, encapsulated within the WGS84 bounding box [ Each image stores data as unsigned 16-bit integers. The images for the Graz region have the size of 946×825 pixels, corresponding to an area of 78 km 2 at a 10 m spatial resolution, while those for the region of Portorož, Izola and Koper are 1212×508 pixels in size, representing an area of 61 km 2 at the same spatial resolution. All images contain 13 basic spectral layers (B01-B12) and 4 calculated layers: Normalised Difference Vegetation Index (NDVI), Normalised Difference Moisture Index (NDMI), Normalised Difference Water Index (NDWI) and Normalised-Difference Snow Index (NDSI). The spatial resolution of layers ranged from 10 m to 60 m. The described TS image for both regions are visualised in Figure 6.

Land Cover Ground Truth Creation
The next step of dataset creation was the superpixel segmentation. Segmentation masks serve as the basis for creation of the time series of ground truth land cover mappings TS gt_cover . An individual image was segmented, based on the normalised values of 13 basic spectral layers. The Felzenszwalb's segmentation algorithm used the parameters σ = 0.5, k = 12 and min_size = 15 pixels. These values were selected empirically, to ensure a balanced tradeoff between segments being small enough to capture complex land objects and the total number of segments not exceeding the storage limitations of hardware that we used for deep learning.
The created TS mask for the region of Graz contains 224,665 segments (avg. 5616 segments/mask). Similarly, the TS mask for the region of Portorož, Izola and Koper has 122,685 segments (avg. 2992 segments/mask). Examples of the segmentation results are visible in Figure 7. The last step of dataset creation was the ground truth land cover labelling. For each dataset region, an individual s of ∀mask ∈ TS mask was labelled with a single class among 15 LULC classes in CLC2018 level 2 [18]. To speed up the labelling process, we utilised a pretrained UNet model by Esri (https://www.arcgis.com/home/item.html?id=afd12484 4ba84da69c2c533d4af10a58 (accessed on 12 July 2023)), which achieved an overall pixelbased accuracy of 84.0% for CLC level 2 classification. A multispectral image with 13 basic spectral layers served as the input to the pretrained model. Before passing the image to the UNet model, each layer had to be normalised and standardised, based on provided pretrained model statistics. The last step of preprocessing was scaling of the input image to 1280 × 1280 pixels (nonsquare images were zero padded). The input image was then passed into the UNet model, which output a tensor of size 1280 × 1280 × 15 (15 CLC level 2 classes). The output tensor was reshaped into a size of 1280 × 1280 × 1 by performing the argmax function across the 3rd axis. The classification tensor was then reshaped to match the size of the original image by downscaling and removal of the padding. The classification tensor and respective mask were then used to create a ground truth image. This was performed automatically by selecting each individual s from mask and assigning it the majority class of associated pixels from the classification tensor. The automatically created ground truth image was finalised with manual label corrections of segments. This was necessary due to rare misclassifications from the UNet model. By performing the described process on all images in individual TS image , the time series was created of the ground truth land cover mappings TS gt_cover .
The TS gt_cover for the region of Graz contains 12 of the 15 CLC level 2 classes, excluding Inland Wetlands, Maritime wetlands, and Marine waters; meanwhile, the TS gt_cover for the region of Portorož, Izola and Koper includes 13 out of the 15 classes, omitting Mine, dump and construction sites, and Inland wetlands from the selected region. Examples of classification output, obtained with the UNet model, and manually corrected time series of the ground truth TS gt_cover , are visualised in Figure 8. For each region, the TS image was split into TS train_image and TS test_image . Specifically, the Graz region's TS image was split into a TS train_image that containes 29 consecutive images (accounting for 72.5% of all images) from January 2017 to April 2020, and a TS test_image with 11 consecutive images (making up 27.5% of all images) from May 2020 to July 2021. Similarly, the TS image for the region of Portorož, Izola and Koper was split into a TS train_image , containing 29 consecutive images (70.7% of all images) from January 2017 to October 2019, and a TS test_image , containing 12 consecutive images (29.3% of all images) from January 2020 to June 2021. The split ratio was decided upon by the fact that each TS test_image should contain images across more than a year. The TS gt_cover and TS mask of each region were split in the same manner into TS train_gt_cover and TS test_gt_cover , as well as TS train_mask and TS test_mask . The distribution of ground truth land cover labels of pixels in individual TS train_gt_cover and TS test_gt_cover are shown in Figure 9. Note that TS test_gt_cover for the region of Graz lacks pixels with land cover classification of Permanent crops and Open spaces with little or no vegetation.

Results and Discussion
The proposed method was evaluated on the created dataset, with its performance compared against that of the current state-of-the-art. The first Subsection describes the application of the proposed method on the created dataset. The second Subsection describes the analysis of GNN usage in the proposed method, while the classification performance and discussion are presented in the third Subsection.

Application of the Proposed Method and Experimental Parameters
From the dataset creation process, the TS train_mask and TS test_mask of both regions were carried over. For each individual region in the dataset, the TS train_image , TS train_mask and TS train_gt_cover were used for constructing the training graph G train , with TS test_image , TS test_mask and TS test_gt_cover being used to construct the test graph G test . The construction of both graphs for individual region was performed using the empirically set parameter τ = 0.2. This value ensured that the number of inbound edges per node was limited to 5.
For the region of Graz, the constructed G train contained 163,414 nodes, and the G test had 61,251 nodes. Similarly, for the region of Portorož, Izola and Koper, the G train had 86,462 nodes and the G test 36,223 nodes. The distribution of ground truth land cover classification of nodes in G train and G test for both dataset regions is shown in Figure 10.  For the classification of individual region within the created dataset, we used the following set of experimental parameters: • Individual v contained the bbox, which was extended by 20 pixels in both width and height. The bbox was resized to the size of bbox w = 48 and bbox h = 48. It also contained bbox C = 7 layers, specifically, B04, B03, B02, NDVI, NDMI, NDWI and NDSI. • The feature extraction part of the target node classification pipeline starts by passing the segment's bbox through the trainable 2D convolution with 3 kernels. This outputs a tensor with 3 feature maps. These are then passed into the CNN -the state-of-the-art EfficientNetV2-S [51] was selected. It had the fully-connected output layer removed and all the layers trainable. Before passing the 3 feature maps into the EfficientNetV2-S, they were passed through the EfficientNetV2-S's preprocessing transforms. The final output of the EfficientNetV2-S was F = 1280 extracted features. • Each but the last layer in the GNN had a hidden dimension of 256, with a dropout of 0.5. The output dimension of the final layer in the GNN was set to match the number of classification classes N. Specifically, it was set to 12 for the Graz region and 13 for the region of Portorož, Izola and Koper. All the GNN layers were trainable. • Before training, the EfficientNetV2-S model for feature extraction was initialised with ImageNet pretrained weights, because the preliminary experiments showed that this led to lower training loss. The classification pipeline forms a single model and it was, same as in [58], trained in a single training process using the Adam optimiser [74]. The initial learning rate was set to 0.001 and the model was trained for 10 epochs with a random shuffle of the training samples (i.e., subgraphs) inbetween. The batch size was 30. The loss function was categorical cross-entropy, which used class weights, due to unbalanced ground truth land cover labels of the nodes in G train . The edge weights in ∀G sub were ignored during the message passing through the GNN, because initial empirical experiments had shown that that led to lower training loss.
Regarding the experimental protocol, for each region in the dataset the experiments were performed for multiple values of T lookback . A total of 10 runs were performed per experiment.
Overall accuracy reflects correct classification for all classes combined, while the classspecific accuracy assesses the correctness for each individual class independently [75]. The equations for the i-th class accuracy and overall accuracy are defined with (1) [76]. The equations contain TP i , TN i , FP i and FN i , to denote True Positives (the number of samples identified correctly as positive), True Negatives (the number of samples identified correctly as negative), False Positives (the number of samples identified incorrectly as positive) and False Negatives (the number of samples identified incorrectly as negative) for the i-th class, respectively. Recall (or sensitivity) quantifies how many actual positive instances were identified correctly [76]. Precision measures the proportion of true positive predictions among all the positive predictions [76]. The F1-score is the harmonic mean of precision and recall, providing a balance between precision and recall, making it especially useful for imbalanced datasets [76]. All three class-specific metrics are defined with (2) [75]. The metrics can be micro-averaged (all samples contribute equally), macro-averaged (all classes constribute equally) and weighted-averaged (the contribution of each class is weighted by the proportion of its instances in the total data, labelled as w i ). The equations for micro-and macro-averaged recall, precision and F1-score are defined with (3), (4) and (5), respectively [75]. The weighted-averaged variants of the metrics are defined with (6) and (7) [75]. The choice between micro-, macro-, and weighted-averaging depends on the balance of the dataset and the problem that is being solved. Micro-averaging is the preferred approach if every sample in the dataset holds equal importance. On the other hand, macro-averaging is used when the significance is distributed uniformly among all the classes. Finally, weighted-averaging should be used with imbalanced datasets where the relevance of classes is defined by their size [75]. Given the significant imbalance in the created dataset, as previously shown in Figure 10, the weighted F1-score alongside accuracy were used as the primary evaluation metrics.

Analysis of GNN Usage in the Proposed Method
In order to choose the best performing GNN for the proposed method's target node classification pipeline, the experiments were performed on a subset of the created dataset. More precisely, only nodes in G train from the first seven satellite images (from January 2017 to August 2017) were used to train classification models, to classify the G test of the region of Portorož, Izola and Koper. The experimental values for T lookback were 0, 1, 2, 3 . The average accuracy and weighted F1-score were calculated for the CLC classification at class level 2 considering 10 runs. Two GNNs were considered for this evaluation: • The GraphSAGE algorithm [58], which applied LSTM aggregation in each of its layers. The output of the hidden layers was passed through the rectified linear unit (ReLU) activation function, as it was done in [58]. • GAT [57] with outputs of hidden layers being passed through the exponential linear unit (ELU) activation function.
In addition, we also examined the effect of eliminating the GNN from the target node classification pipeline. In this case, global average pooling was performed for each individual extracted feature across all nodes in the subgraph G sub . More specifically, feature vectors from all nodes in each G sub were stacked in a tensor with 1280 columns. Each individual feature then underwent global average pooling, which resulted in a tensor of size 1 × 1280. Following this process, an output fully connected layer was utilised, receiving the tensor as input and yielding the output classification tensor with size equal to the number of classes N. Importantly, when T lookback = 0 and the GNN is not used, the whole classification process essentially becomes an image classification task using EfficientNetV2-S. The experiments of excluding a GNN from the target node classification pipeline were conducted, to determine the necessity of using a GNN in the proposed method to achieve high overall classification performance.
The results displayed in Figure 11 demonstrate how both the choice of GNN and the value of T lookback affect the performance of the proposed method for the CLC level 2 classification. Based on the obtained results, utilising GraphSAGE as the GNN in the target node classification pipeline yielded the highest performance for T lookback values of 0, 2, and 3. Specifically, these values resulted in average accuracies of 0.625, 0.628 and 0.615, and average weighted F1-scores of 0.630, 0.634 and 0.622, respectively. However, an exception was observed at a T lookback value of 1, where GraphSAGE did not exhibit the best performance, resulting in a decrease in average accuracy to 0.594 and weighted F1-score to 0.612. At that specific value of T lookback , the GAT outperformed GraphSAGE, achieving an average accuracy of 0.635 and a weighted F1-score of 0.643. Notably, there was a more significant decline in performance for GAT compared to GraphSAGE, particularly at a T lookback value of 3. This observation suggests that GAT is more sensitive to higher values of T lookback , resulting in a larger drop in classification performance.
Removal of the GNN entirely from the proposed method led to the worst performance across all tested T lookback values. However, as the value of T lookback increased, there was a noticeable improvement in the results. The average accuracy improved from 0.334 to 0.534, while the average weighted F1-score increased from 0.360 to 0.542. These results demonstrate the necessity of utilising a GNN to achieve high classification performance with the proposed method. GraphSAGE demonstrated its superiority as the best-performing GNN.

Classification Performance and Discussion
Based on the results from the previous subsection, the GraphSAGE algorithm, utilising LSTM aggregation, showcased the highest performance among all the evaluated GNNs. Hence, it was chosen as the GNN within the target node classification pipeline for the complete performance evaluation of the proposed method, using both regions in the dataset. Unlike with training in the prior subsection, all nodes in each G train were utilised to train the classification models. The experiments were performed for T lookback = 0, 1, . . . , 7 for each region in the dataset. The average and Standard Deviation of accuracy, precision, recall and F1-score were calculated for the CLC classification at both class levels 1 and 2 considering all runs. The level 1 classification results were derived by using the level 2 results and combining them according to the predefined class structure of level 1, as defined in the CLC dataset [18].
The training for both regions was undertaken on two distinct platforms. For the region of Graz, a dedicated Deep Learning server was used, equipped with two Intel Xeon E5-2683 v4 processors, 98GB of RAM, and four NVIDIA GeForce GTX TITAN X graphics cards. Here, the average epoch training time varied from 18 min for T lookback = 0 to 64 min for T lookback = 7. The training times for 10 runs at T lookback = 0, 1, . . . , 7 were 30, 40,41,44,56,70,88 and 107 h, respectively. The total training time was 476 h.
In contrast, the region of Portorož, Izola and Koper was trained on a personal computer powered by an AMD Ryzen 9 7900X processor, 64GB of RAM and an NVIDIA GeForce RTX 4090 graphics card. The average epoch training time was shorter, from 2 min at T lookback = 0, extending to 12 min for T lookback = 7. The accumulated training times for 10 runs at T lookback = 0, 1, . . . , 7 were reduced considerably, specifically to 3, 4, 6, 8, 10, 13, 17 and 21 h, respectively. This sums up to a total of 82 h of training time.
The pixel-based CLC level 2 classification results of the TS test_image , for both dataset regions, are depicted using various metrics in Tables 1 and 2. For the Graz region, the Esri's UNet model achieved an accuracy of 0.818, weighted precision of 0.841, weighted recall of 0.818 and weighted F1-score of 0.824. The proposed GNN-based method achieved better performance across all metrics, but at various sizes of subgraphs. By classifying the subgraphs formed with T lookback = 2, an accuracy of 0.831 was achieved. The highest weighted precision of 0.865 was achieved by classifying the subgraphs formed with T lookback = 1. Classifying the subgraphs formed with T lookback = 2 led to the highest weighted recall of 0.831 and weighted F1-score of 0.841. The highest macro-and micro-averaged values for precision (0.442 and 0.831) and F1-score (0.468 and 0.831) were all achived with subgraphs formed with T lookback = 2. The same applies for the micro recall (0.831), however, the highest macro recall value of 0.552 was achieved at T lookback = 0. Increasing the value of T lookback above 2 led to gradual worsening of the results across all the observed metrics, both in terms of lower averages and higher Standard Deviations.
As for the region of Portorož, Izola and Koper, Esri's UNet model performed better than the proposed GNN-based method based on every metric. The UNet model's performance statistics were: accuracy-0.792, weighted precision-0.827, macro precision-0.578, and micro precision-0.792. Equally, it demonstrated a consistent weighted and micro recall of 0.792, with a lower macro recall of 0.677. The F1-scores were 0.801 (weighted), 0.589 (macro), and 0.792 (micro). The proposed method achieved the best results when classification was performed of subgraphs formed with T lookback = 0. This resulted in an average overall accuracy and all micro measures (precision, recall, and F1-score) of 0.741. The weighted metrics were as follow: precision-0.770, recall-0.741, and F1-score-0.746. In contrast, the macro measures were quite lower, with precision at 0.520, recall at 0.577, and F1-score at 0.529. Similar to the Graz region, an overall decline in performance across all metrics was observed, beginning immediately after the T lookback value exceeded 0.  The only exception was macro precision, where the UNet model scored higher (0.862) than the proposed method using subgraphs formed with T lookback = 0 (0.847). Increasing the value of T lookback beyond 1 led to a decline in the performance of the proposed method, compared to Esri's UNet model, for the majority of the evaluation metrics. Table 3. Accuracy, precision, recall and F1-score for CLC level 1 classification of TS test_image for a region of Graz. The best result for each metric is marked in bold.  Table 4. Accuracy, precision, recall and F1-score for CLC level 1 classification of TS test_image for a region of Portorož, Izola and Koper. The best result for each metric is marked in bold. For the CLC level 2 classification of the region of Graz, the models of the proposed method, formed with T lookback = 2 (using spatiotemporal information from up to two previous multispectral images), secured the highest average weighted F1-score of 0.841. The model with the best F1-score of 0.846 was selected to create the confusion matrix for the classification of CLC class level 2, as depicted in Figure 12a. For the CLC level 1 classification, the highest average weighted F1-score of 0.888 was achieved by models utilising T lookback = 1, employing information from up to one prior multispectral image. The model boasting the top F1-score of 0.900 was used to generate the confusion matrix for CLC class level 1, showcased in Figure 12b. These confusion matrices provide insights into per-class accuracy and misclassifications exhibited by the proposed GNN-based method.
The results in the confusion matrix in Figure 12a highlight the connection between the accuracy and the number of training samples per class. High classification accuracies were achieved for classes with a large number of training samples, namely, Urban fabric (87.99% with 47,341 subgraphs), Industrial, commercial and transport units (85.98% with 24,658 subgraphs), Arable land (78.53% with 55,852 subgraphs), Forests (90.03% with 25,037 subgraphs) and Inland waters (98.66% with 4834 subgraphs). The only exception to this pattern was Artificial, non-agricultural vegetated areas with an accuracy of 83.15% and 821 training subgraphs, which is considerably less compared to other classes with high classification accuracy. The classes with a smaller number of training samples were classified poorly. Namely, Scrub and/or herbaceous vegetation associations achieved a classification accuracy of 0% by being wrongly mistaken in 68.80% of pixels for a Forest, and also to a lesser extent for other similar land cover classes. The poor accuracy results were also achieved for Mine, dump and construction sites (44.17%), Pastures (8.11%) and Heterogeneous agricultural areas (41.65%).
As for the confusion matrix for CLC level 1 classification in Figure 12b, the best performing model of the proposed GNN-based method achieved accuracy of 91.69% for Artificial surfaces, 87.28% for Agricultural areas, 90.84% for Forest and seminatural areas and 96.90% for Water bodies. The most common misclassifications occurred between Artificial surfaces and Agricultural areas.  For the CLC level 2 classification of the Portorož, Izola and Koper region, the proposed method's models, using T lookback = 0 (i.e., without using any spatiotemporal information from previous multispectral images), achieved the highest average weighted F1-score of 0.764. The top-performing model, boasting an F1-score of 0.764, was utilised to construct the confusion matrix for CLC level 2 classification, as illustrated in Figure 13a. As for the CLC level 1 classification, the models with T lookback = 1 (which used data from one preceding multispectral image) obtained the highest average weighted F1-score of 0.872. The leading model, with an F1-score of 0.889, was employed to develop the confusion matrix for CLC level 1 classification, presented in Figure 13b. The confusion matrix for CLC level 2 classification in Figure 13a illustrates the good performance of the proposed GNN-based method for achieving accuracy above 90% in classifying Industrial, commercial, and transport units (90.80%), Maritime wetlands (96.82%), Inland waters (98.55%), and Marine waters (99.17%). The model also performed well for Urban fabric (84.69%) and Forests (79.83%). Just as in the case of the region of Graz, the results for classes with fewer training samples (subgraphs) were suboptimal. For example, the Open spaces with little or no vegetation class, having 22 subgraphs, attained an overall classification accuracy of 6.63%. Similarly, the Pastures class, with a mere 48 training subgraphs, achieved a classification accuracy of just 1.97%.
The top-performing model of the proposed GNN-based method, as depicted in the CLC level 1 classification confusion matrix in Figure 13b, attained accuracies of 89.66% for Artificial surfaces, 87.04% for Agricultural areas, 76.02% for Forest and seminatural areas, 97.88% for Wetlands and 99.15% for Water bodies. The most common misclassifications were observed between Agricultural areas and Forest and seminatural areas, a trend observed previously in the more detailed CLC level 2 classification confusion matrix in Figure 13a.
For both regions of the dataset, the best proposed method's models were used to calculate individual weighted F1-scores for CLC level 2 classification of all images in the TS test_image , as illustrated in Figure 14. For the region of Graz these scores ranged from 0.813 to 0.889, averaging at 0.860. However, the range was wider for the region of Portorož, Izola and Koper, with scores as low as 0.65 to as high as 0.801, and an average score of 0.739. The variance in these scores for the region of Portorož, Izola and Koper was nearly double (0.0015) that of the Graz region (0.0008).
The pixel-based heatmaps, depicted in Figure 15, illustrate the spatial distribution of cumulative inaccuracies in the CLC level 2 classification for both dataset regions. A subset of these land cover predictions is depicted in Figure 16. Notably, for the Graz region, the urban and industrial areas experienced the highest frequency of misclassifications, followed closely by agricultural zones. In the region of Portorož, Izola and Koper, the seminatural and agricultural landscapes were the most common sites of classification errors.
The findings demonstrate that the proposed GNN-based method outperformed Esri's UNet model consistently across all datasets for CLC level 1 classification. However, when it comes to CLC level 2 classification, the GNN-based method's performance was on par with that of Esri's UNet model. While the proposed method achieved better results across all evaluation metrics in the Graz region, it underperformed in the region of Portorož, Izola and Koper.  Various factors contribute to misclassifications and performance degradation as the value of T lookback increases. This is evident in Figure 17, which displays statistics of subgraphs for both dataset regions. For the region of Graz, a decline in average accuracy and weighted F1-score begins when T lookback equals 3. This decline aligns with the average cumulative spatial coverage of segments in both training and testing subgraphs exceeding ≈0.07 km 2 . The region of Portorož, Izola and Koper displays a different pattern. The average cumulative spatial coverage of segments in the subgraphs are significantly larger than in Graz for all the tested T lookback values. The achieved metric scores start lower in this region in comparison to Graz. The degradation begins immediately as we increase T lookback above zero. In the region of Portorož, Izola and Koper, it is evident that, as the number of nodes in G sub increases, there is a growing disparity in the average cumulative spatial coverage of segments between the training and testing subgraphs. While this disparity could potentially cause a decline in performance, it's worth noting that such a trend was not observed in the Graz region. The number of nodes in a G sub doesn't influence classification performance directly. However, it does control spatial coverage, which affects classification performance significantly. Hence, the primary cause of poor classification performance are overly large spatial regions captured by larger subgraphs through time. Such expansive regions can contain conflicting segment information, resulting in misclassifications. This issue is mainly a result of segmentation parameters, although an increase in the value of T lookback also plays a contributing role. The described issue with conflicting segment information is particularly evident for classes like Industrial, commercial and transport units, Mine, dump and construction sites, as well as Arable land, Pastures, and Heterogeneous agricultural areas, as visualised by the confusion matrices in Figures 12a and 13a.
Dataset imbalance impacts the class-specific accuracies for both dataset regions, as demonstrated in Figure 18. The trend lines suggest higher accuracies are associated with classes that occupy a larger proportion of node labels within the G train . However, several classes have emerged as exceptions to this rule. For instance, in the Graz region, only 2.96% of nodes in the G train are labelled as Inland waters, but this still results in an accuracy of 98.66%. To achieve better performance, complex classes such as Pastures and Open spaces with little or no vegetation require a greater quantity of nodes with corresponding labels.
The proposed method faces certain limitations. Firstly, the segmentation parameters should be selected carefully, because small segments result in G containing a large set of nodes. This leads to the increase of the processing time required to classify all target nodes within the G. Secondly, a smaller value of the τ parameter during graph construction leads to more edges being established between nodes, consequently enlarging the G sub of each individual v target . Larger G sub can potentially capture a too broad spatial region through time, and may introduce conflicting information from the segments, thereby undermining the classification results. Additionally, larger subgraphs result in slower training and extended inference times, which limits the efficiency of the proposed GNN-based method.

Conclusions
This paper presented a novel spatiotemporal method for an object-based land cover classification of satellite imagery using a GNN. A 4-year intermonthly land cover ground truth dataset of Sentinel-2 imagery was created for two regions: Graz in Austria, and the region of Portorož, Izola and Koper in Slovenia. This newly created dataset, classified in accordance with the CLC level 2 nomenclature, was utilised to evaluate the performance of the proposed method. For testing purposes, the method's classification pipeline used EfficientNetV2-S for image feature extraction, and the GraphSAGE algorithm with LSTM aggregation for target node classification.
For the CLC level 2 classification of the region of Graz, the proposed method outperformed the state-of-the-art UNet model across all evaluation metrics. The proposed method yielded the best average weighted F1-score of 0.841 and an accuracy of 0.831, which surpassed the UNet model's results of 0.824 and 0.818, respectively. For the classification of the region of Portorož, Izola and Koper, the UNet model achieved better results compared to the proposed method. The proposed method achieved a peak average weighted F1-score of 0.746 and accuracy of 0.741, underperforming the UNet model's results of 0.801 and 0.792, respectively.
However, by combining the CLC level 2 into the predefined CLC level 1 classification, the proposed method demonstrated far superior performance for both regions. The proposed method yielded the best average weighted F1-score and accuracy of 0.888 for the Graz region, outperforming the UNet model's performance, which achieved a score of 0.857 for both metrics. Similarly, for the region of Portorož, Izola and Koper, the proposed method achieved the highest average weighted F1-score of 0.872 and accuracy of 0.871, beating the UNet model, which scored 0.862 and 0.864, respectively.
The findings indicate that the classification performance of the proposed GNN-method starts to decline when an excessive amount of spatiotemporal information from numerous preceding multispectral images is included in the subgraphs. The proposed method exhibits limitations related to the choice of segmentation and graph construction parameters.
In the future, the proposed method's parameter of maximum distance of nodes based on input edges towards the target node will become adaptable using the multispectral data. The proposed method will also be extended with additional types of input data, and modified to perform object-based regression tasks, e.g., particulate matter trends.

Data Availability Statement:
The data presented in this paper are available on request from the corresponding author. The data are not publicly available due to ongoing analysis and publication commitments.

Conflicts of Interest:
The authors declare no conflict of interest.